.
THE OFFICE - TEXT ANALYSIS
This project’s main purpose is to analyze a TV show in a reliable and measurable way, without the need to watch the whole show or rely on a personal perspective. The selected subject for this analysis is the sitcom ‘The Office’, which was selected mainly for the high availability of data.
This notebook will use the data previously collected and cleaned, to go through the analysis process.
The non-standard libraries used in this notebook are:
Pandas
for data wrangling;
Numpy, Sklearn, sciPy, and spaCy
for mathematical, statistical and machine-learning related tasks;
NetworkX, matplotlib, Seaborn
for visualizations;
# install required libraries
import sys
# !{sys.executable} -m pip install numpy
# !{sys.executable} -m pip install pandas
# !{sys.executable} -m pip install scipy
# !{sys.executable} -m pip install scikit-learn
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m pip install matplotlib
# !{sys.executable} -m pip install seaborn
# !{sys.executable} -m pip install networkx
import json
import scipy
import spacy
import pandas as pd
import numpy as np
import networkx as nx
import seaborn as sb
import matplotlib.pyplot as plt
from math import pi
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.interpolate import make_interp_spline, BSpline
from IPython.core.display import display, HTML
# download model for spaCy
# python -m spacy download en_core_web_sm
import en_core_web_sm
# configure backend to increase the visualizations resolution
%config InlineBackend.figure_format ='retina'
# load data
df = pd.read_csv('the_office_features.csv', sep=';', encoding='utf-16')
ratings = pd.read_csv('ratings.csv', sep=';', encoding='utf-16')
print('Main dataset, first 5 rows:')
df.head()
The first step to analyze the show is to define who are the main characters.
This may have several different interpretations, but for this project, we are considering some metrics to do so. The points to be considered are:
The number of dialogs and episodes is the main indicator of who are the main characters since the characters who had the biggest proportions of dialog and appeared in most of the episodes receive more attention and therefore should be the main characters.
There's a challenge in this part because of special guests and characters which had very big importance for a short amount of time. Those characters appear to had lots of dialogs, they participate in lots of episodes, but they're just around for a couple of seasons at most, this is why we're considering the season to calculate the main characters score.
To solve this issue I developed a score that considers all the above-mentioned approaches.
We start by aggregating the numerical fields and get their respective descriptive statistics such as means, standard deviations, medians, and other aggregations
# Build a new data frame with aggregated measures
def build_df(temp):
temp_describe = temp.describe()
chars = temp_describe.index.to_list()
new_df = pd.DataFrame(chars)
new_df.columns = ['chars']
# count of dialogs
new_df['dialogs'] = temp_describe['id']['count'].values
# Words
new_df['avg_words'] = temp_describe['words_qty']['mean'].values
new_df['std_words'] = temp_describe['words_qty']['std'].values
new_df['25%_median_words'] = temp_describe['words_qty']['25%'].values
new_df['50%_median_words'] = temp_describe['words_qty']['50%'].values
new_df['75%_median_words'] = temp_describe['words_qty']['75%'].values
# Sentences
new_df['avg_sentences'] = temp_describe['sentences_qty']['mean'].values
new_df['std_sentences'] = temp_describe['sentences_qty']['std'].values
# Sentiment analysis
new_df['positive'] = temp_describe['positive']['mean'].values
new_df['neutral'] = temp_describe['neutral']['mean'].values
new_df['negative'] = temp_describe['negative']['mean'].values
new_df['compound'] = temp_describe['compound']['mean'].values
# total words, number of seasons and number of episodes
new_df['total_words'] = temp.sum()['words_qty'].values
new_df['unique_s'] = temp['season'].nunique().values
new_df['unique_ep'] = temp['ep_seas'].nunique().values
return new_df
# Similar to the previous method, for building a single dataframe per characters
def build_char_df(name):
temp = df[df['name'] == name].groupby(by='episode_name')
char_df = pd.DataFrame(temp.count().index)
char_df.columns = ['ep_name']
char_df['dialogs'] = temp.count()['text'].values
char_df['mean_sent'] = temp.mean()['sentences_qty'].values
char_df['mean_words'] = temp.mean()['words_qty'].values
char_df['mean_positive'] = temp.mean()['positive'].values
char_df['mean_negative'] = temp.mean()['negative'].values
char_df['mean_neutral'] = temp.mean()['neutral'].values
char_df['mean_compound'] = temp.mean()['compound'].values
char_df['total_sent'] = temp.sum()['sentences_qty'].values
char_df['total_words'] = temp.sum()['words_qty'].values
ratings['ep_name'] = [x.upper() for x in ratings['ep_name']]
char_df = char_df.merge(ratings, how='right',left_on = 'ep_name', right_on = 'ep_name')
char_df.drop(['Unnamed: 0'], axis=1, inplace=True)
char_df = char_df.sort_values(by=(['season', 'ep_num'])).fillna(0)
return char_df
This is the indicator I developed to help to find the main characters of the series and classify them by relevance to the show.
Score = nep + (nd / nep) * (ns/5)
nep = number episodes;
nd = number of dialogs;
ns = number of seasons;
5 is a threshold I used.
The idea is to "penalize" characters that appeared in less than 5 seasons (approx. half the series) and give more significance to characters that appeared in more than 5 seasons.
temp = df.groupby('name')
top_chars = build_df(temp)
score = top_chars['unique_ep']+(top_chars['dialogs']/top_chars['unique_ep'])*(top_chars['unique_s']/5)
top_chars['score'] = score
all_chars = top_chars.sort_values(by='score', ascending=False)[15:]
top_chars = top_chars.sort_values(by='score', ascending=False)[:15]
top_chars
Main characters according to the score are:
top_chars.chars.unique().tolist()
#top_chars.to_csv('the_office_main_chars.csv', sep=';', encoding='utf-16', index=False)
#plot the main characters and their scores
top_chars = top_chars.sort_values(by='score', ascending=True)
fig, ax = plt.subplots(1,figsize=(16,6))
plt.barh(top_chars['chars'], top_chars['score'])
plt.title('Most relevant characters')
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='x', linestyle='--')
top_chars = top_chars.sort_values(by='score', ascending=False)
top_chars[['chars','unique_ep', 'dialogs', 'unique_s', 'score']]
To test our score we can compare the distributions for our selected variables.
In the bellow chart the values are displayed as:
The chart compares the main characters(red) selected by the score with all the other characters(blue).
fig, ax = plt.subplots(1, figsize=(16,8))
plt.scatter( all_chars.unique_ep, all_chars.dialogs,linewidths = all_chars.unique_s, label = top_chars.chars)
plt.scatter( top_chars.unique_ep, top_chars.dialogs,linewidths = top_chars.unique_s, label = top_chars.chars, color = 'red')
plt.title('Characters Dialogs and Number of Episodes')
plt.legend(['All Characters', 'Main Characters'])
plt.xlabel('Episodes')
plt.ylabel('Dialogs')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
We can also see how the score is handling the 'Average dialogs', in the bellow chart we have:
fig, ax = plt.subplots(1, figsize=(16,8))
plt.scatter(all_chars.dialogs/all_chars.unique_ep, all_chars.dialogs, linewidths = all_chars.unique_s, label = top_chars.chars)
plt.scatter(top_chars.dialogs/top_chars.unique_ep, top_chars.dialogs, linewidths = top_chars.unique_s, label = top_chars.chars, color = 'red')
plt.title('Characters Dialogs, Totals and Averages')
plt.legend(['All Characters', 'Main Characters'])
plt.xlabel('Average Dialogs')
plt.ylabel('Total Dialogs')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
A very interesting characteristic we can analyze is the number of words and sentences a character says, characters with a high average of those are the ones who have lots to talk about, they don't just react to situations, they have something to add to it.
We can say that subjectivity is something like the number of words you said minus what you're actually saying. So saying lots of words to pass a small message means there's lots of subjectivity
In the bellow displayed chart, we can see the blue bars representing the means, and the black lines the standard deviation of those, the problem, in this case, is that there's a huge difference between the mean and the standard deviations. This means our data have extreme outliers, so the averages are not such a good indication of who talks more or less, they just give us a slight idea of it.
top_chars = top_chars.sort_values(by='avg_words', ascending=True)
fig, ax = plt.subplots(1,figsize=(16,6))
plt.barh(top_chars['chars'], top_chars['avg_words'], xerr=top_chars['std_words'])
plt.xlim([0,31])
plt.title('Average words in dialog')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='x', linestyle='--')
Since we're not able to get the full understanding with the means, we can analyse the medians for those characters.
top_chars = top_chars.sort_values(by='50%_median_words', ascending=True)
fig, ax = plt.subplots(1,figsize=(16,6))
plt.bar(top_chars['chars'], top_chars['50%_median_words'])
plt.title('Median words in a dialog')
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_axisbelow(True)
ax.grid(axis='y', linestyle='--')
Medians are more interesting, there is one character in specific that is different from everyone else. Kevin, from all the main characters uses in median the lowest amount of words per dialogs.
And we can easily find evidence of that.
https://www.youtube.com/watch?v=_K-L9uhsBLM
To analyse the overall sentiment polarity of the show and it's main characters we're using VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.
The visualization of the results aims at displaying the characters of the show, and the average positive and negative sentiments for each of them.
For a fair perspective of those values we're comparing them in the same scales where:
So the range of 0.04 to 0.23, is applied to both the x and y axis.
fig, ax = plt.subplots(1,figsize=(8,8))
plt.scatter(top_chars['positive'], top_chars['negative'], marker='x')
ax.axhline(0.14, linestyle='--', color='grey')
ax.axvline(0.14, linestyle='--', color='grey')
plt.xlim([0.04, 0.23])
plt.ylim([0.04, 0.23])
plt.xlabel('Positive')
plt.ylabel('Negative')
for i, name in enumerate(top_chars['chars'].values):
if(name in ['Angela', 'Stanley', 'Meredith']):
ax.annotate(name, (top_chars['positive'].values[i], top_chars['negative'].values[i]))
We can see that most characters have a similar behavior in matters of polarity in their dialogs, the values concentrate in high positive and low negative for the vast majority of them, but we can also see some outliers away from the group.
As mentioned before, most of the characters have a high positive score of around 0.14 to 0.20, with a low negative score of 0.06 to 0.8.
But we can note some characters with higher negative scores and also a character with a lower positive score.
Stanley, is the most distant from the other characters, he has a relatively low positive score but his negative score isn't so high either.
This means his dialogs are mostly neutral, almost like he doesn't want to get involved. https://www.youtube.com/watch?v=iahcJPo9Dwg
fields = ['chars', 'dialogs', 'avg_words', 'positive', 'neutral',
'negative', 'compound', 'unique_s', 'unique_ep', 'score']
top_chars[top_chars['chars'].isin(['Angela', 'Stanley', 'Meredith'])][fields]
The file 'conversations.json' contains one record for every scene on the show, where the record contains the name of the characters that had some dialog in the scene and the respective number of dialogs that character had.
These conversations will be used to calculate a score for the relations between the characters.
file = open('conversations.json')
conversations = file.read()
conversations = json.loads(conversations)
print('first 5 rows:')
conversations[:5]
In order to compare the relationship between the characters the following formula was developed:
$\sum min(nx, ny)/max(nx, ny)$
Where:
nx = number of dialogs character x had in a conversation;
ny = number of dialogs character y had in a conversation;
This score is based on the concept that a perfectly balanced conversation will have the same amount of dialogs between both agents.
E.g.: A conversation with three characters x, y and z;
Where x said 5 dialogs, y said 5 dialogs, and z said 1 dialog will result in a score between x and y of 1, while the score between x and z will be 0.2.
The scores are them aggregated with all scores from the same relation so they can be compared, it's important to note that this will result in generally higher scores for characters that communicate a lot and lower scores for characters that don't.
# relationship score calculation
def calc_score(a, b):
if(a<b):
temp = a
a = b
b = temp
return b/a
scores = []
names = []
for name_A in top_chars['chars'].unique():
for name_B in top_chars['chars'].unique():
score = 0
if name_A == name_B:
continue
#print(name_A, name_B)
for talk in conversations:
if name_A in talk and name_B in talk:
score += calc_score(talk[name_A], talk[name_B])
scores.append(score)
names.append([name_A, name_B])
# build a dataframe with the main character names
df_rel = pd.DataFrame(top_chars['chars'].unique())
df_rel.columns = ['names']
# fill the dataframe with 0s
for name in top_chars['chars'].unique():
df_rel[name] = np.zeros(len(top_chars['chars'].unique()))
# set name as index
df_rel.set_index('names', inplace = True)
# store scores to their respective rows
for i, n in enumerate(names):
df_rel[n[0]][n[1]] = scores[i]
After calculating the relationship scores for every character of the show we have the following data:
df_rel
At this point we'll start comparing the relationships and describing them as 'strong' or 'weak', depending on the value of their scores. It's important to note that a strong relationship in this context doesn't relate to the sentiment involved between the characters, so it won't necessarily be a positive relation.
In this context, a strong relationship means the characters communicate a lot.
mask = []
for i in np.arange(len(top_chars['chars'].unique())):
mask.append(np.concatenate( (np.ones( i+1, dtype=bool),
np.zeros( len(top_chars['chars'].unique())-i-1, dtype=bool))).tolist())
fig, ax = plt.subplots(1,figsize=(16,8))
sb.heatmap(df_rel, annot=True, fmt="g", cmap='RdBu', mask=mask, vmin = 0, vmax = 500)
plt.show()
By themselves the scores are already very meaningful, we can tell that Pam and Jim have the strongest relationship among all the other relations.
We can also notice that Michael, the main character of the show, has an overall higher score with everybody when compared to 'lower-ranked' main characters such as Meredith, Creed, or Darryl.
This makes sense from the perspective that Michael has been communicating more constantly with everybody in the show, so he probably has a stronger relationship with most characters.
To extract even more information about the relationships we can normalize the scores, in this case we'll do so by standarizing the values, or calculating their z-scores. This will allow us to see how many standard deviations aways from the mean each relation is.
Simplyfing, we want to see how extreme are those relationships for each characther.
# Z-Scores
df_nor=(df_rel-df_rel.mean())/df_rel.std()
mask = []
for i in np.arange(len(top_chars['chars'].unique())):
mask.append(np.concatenate( (np.zeros( i, dtype=bool),
np.array([1], dtype=bool),
np.zeros( len(top_chars['chars'].unique())-i-1, dtype=bool))).tolist())
fig, ax = plt.subplots(1,figsize=(22,8))
sb.heatmap(df_nor, annot=True, fmt="g", cmap='RdBu', mask=mask)
ax.invert_yaxis()
ax.xaxis.tick_top()
plt.show()
One way of improving this visualization is by showing the actual p-values, they represent how likelly it should be to find those values in the distribution.
In this case, we'll look for relationships with a lower than 0.05 p-value, to account for 95% of confidence level that those relationships have a statistically significant difference from the average relationships of the analysed characters.
df_p_values = scipy.stats.norm.sf(abs(df_nor))*2
fig, ax = plt.subplots(1,figsize=(22,8))
sb.heatmap(df_p_values, annot=True, fmt="g", cmap='Blues_r', mask=mask)
ax.invert_yaxis()
ax.xaxis.tick_top()
plt.xticks(np.arange(0.5,15.5),df_rel.columns)
plt.yticks(np.arange(0.5,15.5),df_rel.columns, rotation=0.9)
plt.show()
With 95% of confidence, the bellow listed relationships had a higher amount of conversation score than the average relationships.
Michael -> Dwight
Dwight -> Michael
Dwight -> Jim
Jim -> Dwight
Jim -> Pam
Pam -> Jim
Angela -> Dwight
Andy -> Dwight
Darryl -> Andy
Ryan -> Michael
Stanley -> Phyllis
# build dictionary with lists of 'from' and 'to', for plotting the network graphs
from_to = {'from':['Michael','Dwight','Dwight','Jim','Jim',
'Pam','Angela','Andy','Darryl','Stanley','Ryan'],
'to':['Dwight', 'Michael', 'Jim', 'Dwight', 'Pam', 'Jim',
'Dwight', 'Dwight', 'Andy', 'Phyllis', 'Michael']}
# build a data frame from the dictionary
df_net = pd.DataFrame(from_to)
Visualize the strongest relationships in a network chart
# plot network chart
fig, ax = plt.subplots(1,figsize=(16,8))
G=nx.from_pandas_edgelist(df_net, 'from', 'to', create_using=nx.DiGraph() )
nx.draw(G, with_labels=True, node_size=3500, alpha=0.5, arrows=True,
linewidths=1, font_size=15, pos=nx.circular_layout(G))
plt.title("Relationships");
The word and terms frequency can give us an interesting perspective of how the characters communicate and what the show is about.
To start we can visualize the show's most frequent words in a word cloud, to do that we're using a bag-of-words algorithm that'll select and display the words and terms with the highest frequency.
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
char_text = ' '.join(df.clean_txt.astype(str).values)
wordcloud = WordCloud(width=1800, height=1800, background_color ='#293F3F', colormap = "Reds",
max_font_size = 250).generate( char_text.upper())
fig = plt.figure(figsize = (15, 15), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
We can see in the above visualization that many of the words relate to people, words such as names, and pronouns are very common in their daily communications. We can also see that many of those words have little to no meaning by themselves.
To improve on that we can check what are the distinguishable terms spoken by the characters, in other words, we'll remove words that are common to all characters and focus on the words that are specific to each of the main characters.
Term Frequency - Inverted document Frequency (TF-IDF), is a way method to compare how many times a term appeared in a document with how many documents the term appeared in.
Term Frequency( t, d ) * Inverse Document Frequency( t )
Term = t
Document = d
Get the difference between the mean score for all characters and the character score, this will show how above or bellow the average each words was said by character.
The result is then sorted to get the most above the average words for each character.
# Prepare a dataframe
# get texts by character
df_txt = df[df.name.isin(top_chars.chars.values)]
# group the texts for each character in a single string
all_txt = []
for char in df_txt.name.unique():
temp_df = df_txt[df_txt.name == char]
temp_txt = []
for i, row in temp_df.iterrows():
temp_txt.append(str(row.clean_txt))
all_txt.append(' '.join(temp_txt))
# Create a dataframe then add main characters and texts to it
df_txt = pd.DataFrame(df_txt.name.unique())
df_txt.columns = ['name']
df_txt['text'] = all_txt
# use a vectorizer to build a sparce matrix that'll hold every word for each character
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df_txt.text)
# Test by checking how many times the word 'business' was said on the show
test = count_vect.vocabulary_.get(u'business')
print('*test*\nHow many times the words business appears: '+str(test))
# calculate tfidf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# convert matrix to array and, get feature names from vector
df_words = pd.DataFrame(X_train_tfidf.toarray(), columns = count_vect.get_feature_names())
# build data frame
df_words['word'] = df_txt.name.unique()
df_words.set_index('word', inplace = True)
df_words = df_words.transpose()
df_words['sum_score'] = df_words.sum(axis=1)
df_words['mean_score'] = df_words.sum(axis=1)/15
print('last 5 rows:')
df_words.tail()
The 10 most distinct words by character
# Build a dataframes with the top 10 more distinguishable words for each character
df_tfidf = pd.DataFrame(np.arange(1,11))
for name in top_chars.chars.values:
temp = (df_words[name]-df_words['mean_score']).sort_values(ascending = False)[:10]
df_tfidf[name] = temp.index.tolist()
df_tfidf[name+'_values'] = temp.values
df_tfidf[top_chars.chars.values]
One of the many ways of breaking down all this data is by analysing the characters individually, from this point on the previously discussed methods will be addapted for a single character.
Beside from the previously seem data, in this section we'll also explore the ratings.
# Define the character to be analysed
name = 'Michael'
myplot = build_char_df(name)
print('The options are:')
print(top_chars['chars'].values)
print('\nSelected character is: '+name)
The polarity scores for each dialog were generated by VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.
The sentiment analysis displays high amounts of Neutral interactions and low amounts of negative and positive for most characters. To better visualize the small differences between those scores we can normalize them.
df_radar = top_chars[['chars', 'positive', 'neutral', 'negative']].copy()
df_radar.columns = ['chars', 'POS', 'NEU', 'NEG']
df_radar.set_index('chars',inplace=True)
normalized_df = df_radar
# normalize
# z-score = ( x - mean ) / standard deviation
normalized_df = (df_radar - df_radar.mean()) / df_radar.std()
# option 2, normalize by range (x - min)/(max - min)
#normalized_df = (df_radar-df_radar.min())/(df_radar.max()-df_radar.min())
normalized_df
To visualize the three normalized variables (positive, negative, and neutral), we'll be using radar charts, with the normalized data we can more easily compare the extents of each polarity in the selected character.
index = normalized_df.index.to_list().index(name)
# define figure size
fig, ax = plt.subplots(1, figsize=(8,8))
# get the fields to a list
categories=list(normalized_df)
N = len(categories)
# Add values to a list and repeat last value to close the triangle
values= normalized_df.values[index].tolist()
values += values[:1]
# calculate the angles and repeat last value to close the cirle
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# define a subplot axis
ax = plt.subplot(111, polar=True, facecolor='#494949')
# draw x lines and labels (neu, pos, neg)
plt.xticks(angles[:-1], categories, color='black', size=12)
# draw circles and labels (25%, 50%, 75%)
ax.set_rlabel_position(0)
plt.yticks([-1.5,0,1.5], ["-1.5","0","1.5"], color="#FFD609",size=13, alpha=1)
plt.ylim(-2.75,2.75)
# Plot data (lines)
ax.plot(angles, values, linewidth=0.6, linestyle='solid', color='black')
# fill triangle
ax.fill(angles, values, color='#0999FF', alpha=1)
# define title and save pic
plt.title(normalized_df.index[index])
plt.savefig(normalized_df.index[index]+'.png', edgecolor='none')
We can also visualize the distribution of the polarity trough the episodes, this should allow us to see changes in the character behavior and outliers that may be interesting to look closer.
fig, ax = plt.subplots(1, figsize=(14,8))
x = np.arange(1, len(myplot['ep_name'])+1)
plt.bar(x, myplot.mean_positive, width=1)
plt.bar(x, myplot.mean_neutral, bottom = myplot.mean_positive, color = 'grey', width=1)
plt.bar(x, myplot.mean_negative, bottom = myplot.mean_positive+myplot.mean_neutral, color = 'orange', width=1)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
plt.title(name+' Sentiment by Episode');
In this section, we'll repeat the methods used in '5 - Words Frequency', but this time with a single character, and we'll also add a method from spaCy, that can help us identify the entities mentioned in the dialogs.
Here we can analyze the most distinguishable terms for a specific character, the sizes are adjusted as per the more distinguishable the term the bigger the font size.
# print an HTML <h2> tag for each word from the characters distinct terms,
# start printing with font-size 60 and reduce the size by 4 for each subsequent word
font_size = 60;
for i in df_tfidf[name]:
display(HTML('<h2 style="font-size:'+str(font_size)+'px";>'+i.upper()+'</h2>'))
font_size -= 4;
Here we're building a word cloud with the most frequent terms the character said, the cleaned version of the text is being used for visualization.
# build a single string with all the text
char_text = ' '.join(df[df.name == name].clean_txt.astype(str).values)
# build the word cloud
wordcloud = WordCloud(width=1800,height=1800,
background_color ='#293F3F',
colormap = "Reds",
#mask = mask,
max_font_size = 250
).generate(char_text.upper())
# adjust and display the figure
fig = plt.figure(figsize = (15, 15), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
Here we'll visualize what are the most commonly mentioned entities, more specifically in this section we'll filter people, organizations, products, locations and events mentioned in the dialogs and them we'll count them to visualize the most mentioned in the show by the selected character
# load model
nlp = en_core_web_sm.load()
#build a single string with all the text and assign it to the spaCy object
doc = nlp('\n'.join(df[df.name == 'Michael'].text.values))
# get a list of tupples for the identified entities in the text
ent_list = [(ent.text, ent.label_) for ent in doc.ents]
# Build a dataframe
df_ent = pd.DataFrame(ent_list)
df_ent.columns = ['name', 'type']
df_ent['count'] = np.ones(len(ent_list))
# Filter types, sort the values and display the top 15 entities mentioned
types = ['PERSON', 'ORG', 'PRODUCT', 'LOC', 'EVENT']
df_ent_filtered = df_ent[df_ent['type'].isin(types)]
df_ent_filtered = df_ent_filtered.groupby(['name','type']).count().sort_values('count', ascending = False)[:15]
# display top 15 most mentioned entities
df_ent_filtered
In regards to Michael, we can see something common between the words and terms frequencies. They're all strongly related to people.
In the TF-IDF scores, Michael's most distinguishable words have 2 pronouns (Everybody, and Somebody) and 5 names from the top 10 words. In the bag-of-words algorithm its harder to visualize the patterns since there are many meaningless words, but still, we can also see lots of names and pronouns related to people.
The strongest evidence of this is the most frequent entities mentioned by Michael, from the 15 words displayed only one is not a person, and this exception is actually the name of their city. This suggests that Michael is someone whose biggest interests are in people and the community.
https://www.youtube.com/watch?v=vrPgsrfZWOU&feature=youtu.be&t=327
Here we can verify the correlation (Pearson method) between the previously analysed measures and the actual ratings for the episodes
char_df = build_char_df(name)
corr = char_df.corr(method='pearson')
ax = plt.axes()
happiness_corr = sb.heatmap(corr.iloc[:-1,-1:], vmin=-0.6, vmax=0.6, ax = ax)
fig = happiness_corr.get_figure()
ax.set_title(name)
plt.show()
We can also compare any given variable with the actual ratings, this helps us visualize how much related those values are.
var = 'dialogs'
print('The options are:')
print(['dialogs', 'mean_sent', 'mean_words',
'mean_positive', 'mean_negative', 'mean_neutral',
'mean_compound', 'total_sent', 'total_words'])
print('\nSelected variable: '+var)
fig, ax = plt.subplots(1,figsize=(18,10))
plt.title(name)
x = np.arange(1, len(myplot['ep_name'])+1)
xnew = np.linspace(x.min(), x.max(),50) #50 represents number of points to make between T.min and T.max
spl = make_interp_spline(x, myplot[var], k=3) #BSpline object
power_smooth = spl(xnew)
plt.plot(xnew, power_smooth, linewidth=2)
plt.ylabel(var)
plt.xlabel('Episode')
plt.legend([var], loc = 'upper left')
ax2 = ax.twinx()
x = np.arange(0, len(myplot['ratings']))
xnew = np.linspace(x.min(), x.max(),50) #50 represents number of points to make between T.min and T.max
spl = make_interp_spline(x, myplot['ratings'], k=3) #BSpline object
power_smooth = spl(xnew)
plt.plot(xnew, power_smooth, color='red', linewidth=2)
plt.ylabel('Rating')
plt.legend(['Ratings'], loc = 'upper right')
plt.show()